Comparison of different strategies for utilizing two CHEMDNER corpora
نویسندگان
چکیده
To identify chemical entities and drug names in patent according to CHEMDNER patent task-CEMP subtask, we use machine learning technique to construct a chemical named entity recognition (CNER) system. It is desirable for machine-based CNER system to have large training examples. Two CHEMDNER corpora have been developed. One is the corpus for the patent task and the other is the CHEMDNER corpus for PubMed abstract constructed for CHEMDNER task in BioCreative IV. Both corpora were constructed based on very similar guidelines. However, the style of writing is different. In this paper, we are discussing different strategies to utilize these two corpora to identify chemical entities in patent. Our basic system uses conditional random field (CRF) as a machine learning technique that uses linguistic features in addition to domain knowledge feature produced by ChemSpot. We compare the results of these strategies using simple system performance measures (e.g., recall, precision, and F-score) and analysis on the unique findings of each system.
منابع مشابه
The CHEMDNER corpus of chemicals and drugs and its annotation principles
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in...
متن کاملA Genre Analysis of Reprint Request E-mails Written by EFL and Physics Professionals
The present study aimed to analyze reprint request e-mail messages written by postgraduates (MA students) of two fields of study, namely Physics and EFL, to realize the differences and similarities between the two email types. To investigate the purpose of the study, a sample of 100 e-mail messages, 50 Physics and 50 EFL, were analyzed according to Swales’ (1990) model for reprint requests and ...
متن کاملAn Investigation of the Relationship between Gender and Different Strategies of Expressing Request in English and Persian Films
The main objective of the present study is to elaborate the contrasts between males and females in their use of different strategies of request in English and Persian and ascertain the degree to which independent variables like gender and language affect the application of these strategies during informal communication.Furthermore, it offers comparable corpora which provide a good basis for cro...
متن کاملEnhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization
BACKGROUND The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound a...
متن کاملA comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
BACKGROUND Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015